Introduction

About The Project

INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations. In the hotel industry, effective management of bookings is crucial for optimizing revenue and ensuring operational efficiency.

Our main objective in this project is to analyze the dataset to find which factors have a high influence on booking cancellations while answering some SMART questions. By examining customer attributes such as room preferences, lead time, meal plan choices, and past booking behavior, we will identify key patterns and trends. Additionally, we will also analyze other variables such as special requests, arrival month and market segments.

Exposition

The dataset used in this project, sourced from Kaggle, contains detailed information on hotel bookings. With around 36,000 records from city and resort hotels, it provides insight into the factors contributing to hotel booking cancellations. The dataset includes variables such as lead time (the number of days between booking and arrival), room type, market segment, average daily rate (ADR), and special requests. These variables offer a robust foundation for analyzing cancellation trends and customer behavior.

A key strength of this dataset is its comprehensive scope. It includes not only basic booking details but also behavioral indicators like special requests, allowing for a deeper exploration of how different types of customers behave. This enables us to evaluate factors beyond the usual booking characteristics, such as how the length of stay or lead time influences cancellation likelihood.

It’s worth keeping in mind several limitations that could influence the results of the analysis. First, this dataset lacks demographic data like specific age groups, gender, or income, which are often critical in understanding customer behavior patterns. Without this information, it becomes more difficult to explore how demographic factors could correlate with cancellations. The dataset also only contains bookings from a 2017 to 2018, which limits the ability to understand longer-term trends or assess the impact of other external factors that might have fluctuated across different years, such as economic upturns/downturns, the Covid-19 pandemic, or general industry shifts.

The dataset source does not explicitly state how the data was gathered, but it was likely obtained from a “hotel reservation system” or HRS. This system is designed to automate the booking process, manage room availability, and handle customer data, including booking status, lead time, room type, special requests, and much more. The data is typically stored in the hotel’s “Property Management System” or PMS, which centralizes information from various booking channels such as the hotel’s websitem third party platforms (like Booking.com), and over-the-phone bookings. The PMS ensures that reservation data is synchronized across platforms and stored securely for retrieval during check-in and check-out.(1) Through these systems, hotels reduce human error and increase efficiency, although missing data or occasional discrepancies may still occur due to system errors/limitations, or other external factors.

Prior research into hotel booking cancellations has consistently highlighted the importance of lead time—the period between when a booking is made and the actual check-in date—as a key predictor of cancellations. Studies show that customers who book far in advance tend to cancel more often, likely because they have more time to change their plans. This trend has been confirmed in other hospitality datasets, where longer lead times are correlated with a higher likelihood of cancellations.(2) Additionally, market segmentation plays a role in understanding booking behaviors. Corporate clients and group bookings generally exhibited lower cancellation rates compared to individual leisure travelers. This is likely due to the stricter nature of corporate travel policies and pre-arranged group contracts, which offer less flexibility for last-minute cancellations. On the other hand, bookings made through online travel agents (OTAs) tend to have higher cancellation rates compared to direct bookings.(2)

Research into hotel booking patterns significantly influenced the development of the research questions for this project. Based on our prior findings, the hypothesis was that lead time would be a strong predictor of cancellations, as customers who book earlier have more opportunities to cancel. The decision to focus on variables such as special requests and ADR was similarly guided by studies that suggested these variables would also influence cancellation behavior. Additionally, our assumptions concerning seasonality informed our decision to analyze cancellation patterns over different times of the year.

To improve the current analysis, the inclusion of demographic data such as customer age, gender, and income would offer a more thorough understanding of customer behavior, allowing for better segmentation and tailored strategies to prevent cancellations. Additionally, having detailed information about the cancellation policies—specifically (whether bookings were refundable or non-refundable) would provide crucial insights into how financial incentives influence cancellation decisions. This data could help differentiate between voluntary cancellations and those prompted by policy restrictions. Lastly, data on external factors, such as travel restrictions, promotions, or special events that occurred during the booking period, would also provide importent context, helping to explain possible spikes in cancellations or booking behaviors that may otherwise seem anomalous. This broader dataset would allow for a more comprehensive analysis.

  1. Source: Cloudbeds. (2024). What is a Hotel Reservation System? Retrieved from https://www.cloudbeds.com/articles/hotel-reservation-system/
  2. Manuel Banza. (2022). Predicting Hotel Booking Cancellations Using Machine Learning. Retrieved from HospitalityNet.org

SMART Questions

1. Do customers who make special requests cancel their bookings less frequently than those who don’t?
2. Are guests with no previous cancellations more likely to avoid canceling their current booking?
3. How does the booking status varies upon meal plan, room type, lead time and average room price ?
4. What are the key factors that show the strongest correlation with booking cancellations?
5. How do hotel cancellation rates change across different seasons (spring, fall, winter, summer, holidays, low season), and which factors (e.g., lead time, room type/) correlate most strongly with cancellations during each season?

About The Dataset

For this project, we have selected a hotel booking dataset containing over 36,000 records of bookings. INNHotelsGroup dataset from Kaggle. This dataset offers a comprehensive overview of hotel booking patterns, making it ideal for our analysis on cancellations and no-shows.

hotel_data <- read.csv("../Dataset/INNHotelsGroup.csv")

This dataset contains 36275 observations of 19 variables. Out of these observations, 0 rows contain null values.

str(hotel_data)
## 'data.frame':    36275 obs. of  19 variables:
##  $ Booking_ID                          : chr  "INN00001" "INN00002" "INN00003" "INN00004" ...
##  $ no_of_adults                        : int  2 2 1 2 2 2 2 2 3 2 ...
##  $ no_of_children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_weekend_nights                : int  1 2 2 0 1 0 1 1 0 0 ...
##  $ no_of_week_nights                   : int  2 3 1 2 1 2 3 3 4 5 ...
##  $ type_of_meal_plan                   : chr  "Meal Plan 1" "Not Selected" "Meal Plan 1" "Meal Plan 1" ...
##  $ required_car_parking_space          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ room_type_reserved                  : chr  "Room_Type 1" "Room_Type 1" "Room_Type 1" "Room_Type 1" ...
##  $ lead_time                           : int  224 5 1 211 48 346 34 83 121 44 ...
##  $ arrival_year                        : int  2017 2018 2018 2018 2018 2018 2017 2018 2018 2018 ...
##  $ arrival_month                       : int  10 11 2 5 4 9 10 12 7 10 ...
##  $ arrival_date                        : int  2 6 28 20 11 13 15 26 6 18 ...
##  $ market_segment_type                 : chr  "Offline" "Online" "Online" "Online" ...
##  $ repeated_guest                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_price_per_room                  : num  65 106.7 60 100 94.5 ...
##  $ no_of_special_requests              : int  0 1 0 0 0 1 1 1 1 3 ...
##  $ booking_status                      : chr  "Not_Canceled" "Not_Canceled" "Canceled" "Canceled" ...

Data Dictionary

  • Booking_ID: unique identifier of each booking
  • no_of_adults: Number of adults
  • no_of_children: Number of Children
  • no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
  • no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
  • type_of_meal_plan: Type of meal plan booked by the customer:
    • Not Selected – No meal plan selected
    • Meal Plan 1 – Breakfast
    • Meal Plan 2 – Half board (breakfast and one other meal)
    • Meal Plan 3 – Full board (breakfast, lunch, and dinner)
  • required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
  • room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
  • lead_time: Number of days between the date of booking and the arrival date
  • arrival_year: Year of arrival date
  • arrival_month: Month of arrival date
  • arrival_date: Date of the month
  • market_segment_type: Market segment designation.
  • repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
  • no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
  • no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
  • avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
  • no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
  • booking_status: Flag indicating if the booking was canceled or not.

Cleaning The Dataset

For our project, we are interested in majority of the variables but there are some irrelevant columns with respect to our objective and thus we will drop them.

hotel_data_clean <- hotel_data[, c("no_of_adults", "no_of_children", "no_of_weekend_nights", "no_of_week_nights", "type_of_meal_plan", "required_car_parking_space", "room_type_reserved", "lead_time", "arrival_year", "arrival_month",  "market_segment_type", "repeated_guest", "no_of_previous_cancellations", "no_of_special_requests", "booking_status", 
"avg_price_per_room")]

hotel_data_clean <- na.omit(hotel_data_clean)

After cleaning, we are left with 36275 observations of 16 variables.

write.csv(hotel_data_clean,"../Dataset/INNHotelsGroup_min.csv", row.names = F)

Exploratory Data Analysis (EDA)

1. Summary Statistics

hotel_data_clean$type_of_meal_plan <- as.factor(hotel_data_clean$type_of_meal_plan)
hotel_data_clean$room_type_reserved <- as.factor(hotel_data_clean$room_type_reserved)
hotel_data_clean$booking_status <- as.factor(hotel_data_clean$booking_status)
hotel_data_clean$market_segment_type <- as.factor(hotel_data_clean$market_segment_type)
hotel_data_clean$arrival_year <- as.factor(hotel_data_clean$arrival_year)
hotel_data_clean$arrival_month <- as.factor(hotel_data_clean$arrival_month)
summary(hotel_data_clean)
##   no_of_adults  no_of_children  no_of_weekend_nights no_of_week_nights
##  Min.   :0.00   Min.   : 0.00   Min.   :0.00         Min.   : 0.0     
##  1st Qu.:2.00   1st Qu.: 0.00   1st Qu.:0.00         1st Qu.: 1.0     
##  Median :2.00   Median : 0.00   Median :1.00         Median : 2.0     
##  Mean   :1.84   Mean   : 0.11   Mean   :0.81         Mean   : 2.2     
##  3rd Qu.:2.00   3rd Qu.: 0.00   3rd Qu.:2.00         3rd Qu.: 3.0     
##  Max.   :4.00   Max.   :10.00   Max.   :7.00         Max.   :17.0     
##                                                                       
##     type_of_meal_plan required_car_parking_space   room_type_reserved
##  Meal Plan 1 :27835   Min.   :0.000              Room_Type 1:28130   
##  Meal Plan 2 : 3305   1st Qu.:0.000              Room_Type 2:  692   
##  Meal Plan 3 :    5   Median :0.000              Room_Type 3:    7   
##  Not Selected: 5130   Mean   :0.031              Room_Type 4: 6057   
##                       3rd Qu.:0.000              Room_Type 5:  265   
##                       Max.   :1.000              Room_Type 6:  966   
##                                                  Room_Type 7:  158   
##    lead_time   arrival_year arrival_month      market_segment_type
##  Min.   :  0   2017: 6514   10     : 5317   Aviation     :  125   
##  1st Qu.: 17   2018:29761   9      : 4611   Complementary:  391   
##  Median : 57                8      : 3813   Corporate    : 2017   
##  Mean   : 85                6      : 3203   Offline      :10528   
##  3rd Qu.:126                12     : 3021   Online       :23214   
##  Max.   :443                11     : 2980                         
##                             (Other):13330                         
##  repeated_guest  no_of_previous_cancellations no_of_special_requests
##  Min.   :0.000   Min.   : 0.00                Min.   :0.00          
##  1st Qu.:0.000   1st Qu.: 0.00                1st Qu.:0.00          
##  Median :0.000   Median : 0.00                Median :0.00          
##  Mean   :0.026   Mean   : 0.02                Mean   :0.62          
##  3rd Qu.:0.000   3rd Qu.: 0.00                3rd Qu.:1.00          
##  Max.   :1.000   Max.   :13.00                Max.   :5.00          
##                                                                     
##       booking_status  avg_price_per_room
##  Canceled    :11885   Min.   :  0       
##  Not_Canceled:24390   1st Qu.: 80       
##                       Median : 99       
##                       Mean   :103       
##                       3rd Qu.:120       
##                       Max.   :540       
## 

Summary:
The average price per room in the dataset is 103 euros, with a median of 99 euros, but prices can reach up to 540 euros, indicating high-priced outliers. Some entries even show an average price of zero, possibly reflecting promotional deals. Guests typically stay for two weekday nights and one weekend night, with the average number of weekday nights being 2.2 and weekend nights 0.81. Stays can extend to as many as 17 weekday nights. Most bookings involve two adults, and many guests do not bring children. Lead times vary significantly, with an average of 85 days, a median of 57, and some bookings made up to 443 days in advance, suggesting a right-skewed distribution. The data also shows sparse previous cancellations, with an average of just 0.02, and a maximum of 58. Bookings are spread over 2017 and 2018, peaking in August. Additionally, most guests do not make special requests, as the median is zero, although some make up to five requests per booking.

2. Distribution of Booking Status

print(ggplot(hotel_data_clean, aes(x = booking_status)) +
  geom_bar(fill = "lightblue") +
  labs(title = "Booking Status Distribution", x = "Booking Status", y = "Count"))

Summary
The dataset contains 36,275 bookings, with 24,390 classified as “Not Canceled” and 11,885 as “Canceled”. Approximately two-thirds of the bookings were completed, while the remaining one-third were canceled.

3. Analysis of Meal Plan Type

print(ggplot(hotel_data_clean, aes(x = type_of_meal_plan, fill = booking_status)) +
  geom_bar(position = "fill") +
  labs(title = "Cancellation Rate by Meal Plan", x = "Meal Plan", y = "Proportion", fill = "Booking Status"))

Summary   Meal Plan 1 has the highest cancellation rate, where more than half of the bookings are canceled. Meal Plan 2 has a lower cancellation rate, with cancellations and non-cancellations being nearly equal. Meal Plan 3 shows the highest proportion of non-cancelled bookings compared to canceled ones, suggesting customers choosing this meal plan tend to cancel less frequently. For the “Not Selected” category, there is a relatively high cancellation rate, similar to Meal Plan 1.

4. Distribution of Room Types

print(ggplot(hotel_data_clean, aes(x = room_type_reserved, fill = booking_status)) +
  geom_bar(position = "fill") +
  labs(title = "Cancellation Rate by Room Type", x = "Room Type", y = "Proportion", fill = "Booking Status"))

Summary
Room Type 1 and Room Type 6 have the highest cancellation rates, with a large proportion of canceled bookings. Room Type 7 stands out as having the highest proportion of non-cancelled bookings, suggesting it may be a preferred or better-secured type of room. For other room types, the rates are fairly similar, with around 60-70% of bookings not canceled, except for Room Type 1 which has more cancellations.

5. Barplot: Number of Special Requests

print(ggplot(hotel_data_clean, aes(x = factor(no_of_special_requests))) +
  geom_bar(fill = "orange") +
  labs(title = "Number of Special Requests", x = "Number of Special Requests", y = "Count") +
  theme_minimal())

Summary
Most guests make no special requests, with a sharp decline as the number increases. A significant portion makes one request, while two or more requests are increasingly rare.

5. Plot: Market Segment vs Booking Status

contingency_table <- table(hotel_data_clean$market_segment_type, hotel_data_clean$booking_status)

plot_data <- as.data.frame(contingency_table)
colnames(plot_data) <- c("Market_Segment", "Booking_Status", "Count")

print(ggplot(plot_data, aes(x = Market_Segment, y = Count, fill = Booking_Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Booking Status by Market Segment",
       x = "Market Segment",
       y = "Count",
       fill = "Booking Status") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)))

Summary
Online market segment have the highest cancellation count, with a large proportion of canceled bookings of 8475. Offline segment has a cancellation count of 3253 For other market segment, the cancellation rates are too less.

6. Histogram: Lead_time

ggplot(hotel_data_clean, aes(x = lead_time, fill = booking_status)) +
  geom_histogram(binwidth = 10, position = "dodge") +
  labs(
    title = "Lead Time vs Booking Status",
    x = "Lead Time (Days)",
    y = "Number of Bookings",
    fill = "Booking Status"
  ) +
  theme_minimal()

Summary
Here we can see that the booking with short lead times are less cancelled and as the lead time increases there are more booking cancellations. Also, we can see more cancellation happening between lead time form 100-200.

7. Boxplot: avg_price_per_room

ggplot(hotel_data_clean, aes(x = booking_status, y = avg_price_per_room, fill = booking_status)) +
  geom_boxplot() +
  labs(
    title = "Average Room Price vs Booking Status",
    x = "Booking Status",
    y = "Average Room Price"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Summary
The booking with cancelled status has higher median average room price value compared to non-cancelled booking status.This might mean that the booking with her average room price have more chances to get cancelled.

Statistical Analysis

1. Do customers who make special requests cancel their bookings less frequently than those who don’t?

Null Hypothesis (H₀): There is no significant association between special requests and booking status.
Alternative Hypothesis (H₁:)There is significant association between special requests and booking status.

hotel_data_clean <- hotel_data_clean %>%
  mutate(has_special_requests = ifelse(no_of_special_requests > 0, "Yes", "No"),
         is_cancelled = ifelse(booking_status == "Canceled", "Canceled", "Not_Canceled"))

special_requests_table <- table(hotel_data_clean$has_special_requests, hotel_data_clean$is_cancelled)

sum(special_requests_table) > 0
## [1] TRUE
special_requests_test <- chisq.test(special_requests_table)

special_requests_props <- prop.table(special_requests_table, margin = 1)

Contingency Table:

print(special_requests_table)
##      
##       Canceled Not_Canceled
##   No      8545        11232
##   Yes     3340        13158

Chi-Square Test Results:

print(special_requests_test)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  special_requests_table
## X-squared = 2152, df = 1, p-value <2e-16
print(ggplot(hotel_data_clean, aes(x = has_special_requests, fill = is_cancelled)) +
  geom_bar(position = "fill") +
  labs(title = "Cancellation Rates by Special Requests",
    x = "Has Special Requests", y = "Proportion") +
  theme_minimal())

Summary:
The chi-square test result shows a very small p-value which is much less than the typical significance level of 0.05. Due to this we reject the null hypothesis (H₀) and accept the alternative hypothesis (H₁).

Bookings with special requests have a much lower cancellation rate (20.2%) compared to those without (43.2%). The data strongly suggests that customers who make special requests are more likely to follow through with their bookings and significantly less likely to cancel their bookings.

2. Are guests with no previous cancellations more likely to avoid canceling their current booking?

Null Hypothesis (H₀): There is no association between having previous cancellations and the likelihood of canceling the current booking.
Alternative Hypothesis (H₁): Guests with no previous cancellations are less likely to cancel their current booking.

hotel_data_clean <- hotel_data_clean %>%
  mutate(has_previous_cancellations = ifelse(no_of_previous_cancellations > 0, "Yes", "No"))

previous_cancellations_table <- table(hotel_data_clean$has_previous_cancellations, hotel_data_clean$is_cancelled)
previous_cancellations_test <- chisq.test(previous_cancellations_table)

previous_cancellations_props <- prop.table(previous_cancellations_table, margin = 1)

Contingency Table:

print(previous_cancellations_table)
##      
##       Canceled Not_Canceled
##   No     11869        24068
##   Yes       16          322

Chi-Square Test Results:

print(previous_cancellations_test)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  previous_cancellations_table
## X-squared = 120, df = 1, p-value <2e-16
print(ggplot(hotel_data_clean, aes(x = has_previous_cancellations, fill = is_cancelled)) +
  geom_bar(position = "fill") +
  labs(title = "Cancellation Rates by Previous Cancellations",
       x = "Has Previous Cancellations", y = "Proportion") +
  theme_minimal())

Summary:
Due to the p-value being below the significance threshold of 0.05, we reject the null hypothesis. This provides strong statistical evidence to support the alternative hypothesis that guests with previous cancellations are less likely to cancel their current booking.

Surprisingly, bookings from guests with previous cancellations have a much lower cancellation rate (4.73%) compared to those without previous cancellations (33.03%). This represents a substantial difference of about 28.3 percentage points.

3. How does the booking status varies upon meal plan, room type, lead time and average room price ?

3.1 Does booking status vary upon lead time?

Null Hypothesis (H₀): There is no association between booking status and type of meal plan.
Alternative Hypothesis (H₁): There is an association between booking status and type of meal plan.

hotel_data_clean$canceled <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)

# Subsetting the data for customers who canceled
df_canceled <- subset(hotel_data_clean, canceled == 1)

# Subsetting the data for customers who did not cancel
df_not_canceled <- subset(hotel_data_clean, canceled == 0)

ttestlead_time <- t.test(df_canceled$lead_time, df_not_canceled$lead_time, 
                       alternative = "two.sided", conf.level = 0.95)
ttestlead_time
## 
##  Welch Two Sample t-test
## 
## data:  df_canceled$lead_time and df_not_canceled$lead_time
## t = 81, df = 16886, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  78.3 82.2
## sample estimates:
## mean of x mean of y 
##     139.2      58.9

Summary:
The t-test analysis reveals a significant difference in lead times between canceled and non-canceled bookings. On average, bookings that were canceled had a lead time of 139.2 days, while non-canceled bookings had a much shorter average lead time of 58.9 days. This difference is highly statistically significant (p-value < 0.05), suggesting that guests with longer lead times are more prone to cancel their reservations.

3.2 Does booking status vary upon Average Room Price?

Null Hypothesis (H₀): There is no significant difference in cancellation rates based on the Average Room Price.
Alternative Hypothesis (H₁): There is a significant difference in cancellation rates based on the Average Room Price.

# Creating a binary variable for cancellation
hotel_data_clean$canceled <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)

# Subsetting the data for customers who canceled
df_canceled <- subset(hotel_data_clean, canceled == 1)

# Subsetting the data for customers who did not cancel
df_not_canceled <- subset(hotel_data_clean, canceled == 0)

# Performing a t-test to compare average price per room between those who canceled and those who didn't
ttest_avg_price_per_room <- t.test(
  df_canceled$avg_price_per_room,
  df_not_canceled$avg_price_per_room,
  alternative = "two.sided",
  conf.level = 0.95
)
print(ttest_avg_price_per_room)
## 
##  Welch Two Sample t-test
## 
## data:  df_canceled$avg_price_per_room and df_not_canceled$avg_price_per_room
## t = 28, df = 25929, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   9.92 11.39
## sample estimates:
## mean of x mean of y 
##     110.6      99.9

Summary:
With extremely small p-value indicates a statistically significant difference between the average room prices for canceled and non-canceled bookings.The bookings with higher average room prices are more likely to be canceled compared to those with lower prices. Specifically, the average room price for canceled bookings is approximately 10.7 units higher than for non-canceled bookings.

3.3 Does booking status vary upon Meal Plan Type?

Null Hypothesis (H₀): There is no significant difference in cancellation rates based on the type of meal plan.
Alternative Hypothesis (H₁): There is a significant difference in cancellation rates based on the type of meal plan.

hotel_data_clean$type_of_meal_plan <- as.factor(hotel_data_clean$type_of_meal_plan)
hotel_data_clean$booking_status <- as.factor(hotel_data_clean$booking_status)

contingency_table <- table(hotel_data_clean$type_of_meal_plan, hotel_data_clean$booking_status)

chi_test_result <- chisq.test(contingency_table)

print(chi_test_result)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 278, df = 3, p-value <2e-16

Summary:
The Chi-squared test suggests that meal plan type significantly affects whether a booking is canceled or not. The p-value is far below the standard threshold (0.05), meaning the differences observed between meal plans in terms of cancellation rates are highly unlikely to be due to chance. And as seen in EDA, the meal plan 1 has the highest cancellation rate.

3.4 Does booking status vary upon Room Type?

Null Hypothesis (H₀): There is no significant difference in cancellation rates based on the room type.
Alternative Hypothesis (H₁): There is a significant difference in cancellation rates based on the room type.

hotel_data_clean$room_type_reserved <- as.factor(hotel_data_clean$room_type_reserved)
#df$booking_status <- as.factor(df$booking_status)

# Create a contingency table
contingency_table <- table(hotel_data_clean$room_type_reserved, hotel_data_clean$booking_status)

# Perform the Chi-squared test
chi_test_result <- chisq.test(contingency_table)

# Display the result of the test
print(chi_test_result)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 57, df = 6, p-value = 2e-10

Summary:
The statistical analysis shows a significant relationship between the room type and the likelihood of booking cancellation. This insight suggests that certain room types are more prone to cancellations than others. And as we have seen in EDA that Room 1 and Room 6 has the highest cancellation rates, so they are more prone to cancellation.

4. What are the key factors that show the strongest correlation with booking cancellations?

# Convert booking_status to a binary variable
hotel_data_clean$booking_status_binary <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)

# Function to remove columns with zero variance
remove_zero_variance <- function(df) {
  df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}

# Remove columns with zero variance
data_filtered <- remove_zero_variance(select_if(hotel_data_clean, is.numeric))

# Calculate the correlation matrix
cor_data <- cor(data_filtered, use = "complete.obs")

# Visualize the correlation matrix
corrplot(cor_data, method = "color", addCoef.col = "black", 
         title = "Correlation Matrix for Entire Dataset", number.cex = 1, 
         tl.cex = 0.8, mar = c(1, 1, 2, 1))

Summary:
The correlation matrix for the entire dataset reveals several key insights regarding factors associated with booking cancellations. Lead time shows the strongest positive correlation with cancellations (0.44), indicating that bookings made further in advance are more likely to be canceled. This makes sense, since customers have more time to change their plans or reconsider their bookings when the time between reservation and stay is longer. The number of special requests has a negative correlation (-0.25) with cancellations, implying that customers who make more personalized arrangements, such as requesting specific room types or amenities, are generally more committed to their bookings and less likely to cancel. The average price per room shows a weak positive correlation (0.14), suggesting that higher-priced bookings are slightly more prone to cancellation, although this relationship is not very strong. It’s possible that the higher financial commitment associated with more expensive rooms leads some customers to reconsider their bookings. Meanwhile, being a repeated guest has a small negative correlation (-0.11) with cancellations, indicating that loyal customers are slightly less likely to cancel their bookings. Interestingly, factors such as the number of adults, children, weekend nights, and week nights show little to no correlation with cancellations, suggesting that the composition of the travel party and the length of stay do not significantly impact the likelihood of a booking being canceled. In summary, lead time stands out as the most significant predictor of cancellations, while customer loyalty factors such as number of special requests, and being a repeated guest reduce the likelihood of cancellations. Although price plays a role, it is not a major determinant of cancellation behavior.

5. How do hotel cancellation rates change across different seasons (spring, fall, winter, summer, holidays, low season), and which factors (e.g., lead time, room type/) correlate most strongly with cancellations during each season?

In addition to analyzing the overall correlations in the dataset, we decided to perform a seasonality correlation analysis to explore how booking behavior might vary across different times of the year. Since factors like lead time, pricing, and cancellation rates can fluctuate with seasonal trends, understanding how these relationships change across spring, summer, fall, and winter could provide deeper insights into customer behavior and help tailor strategies to minimize cancellations based on seasonal patterns.

Seasonality Analysis:

# Define seasons based on arrival month
hotel_data_clean$season <- case_when(
  hotel_data_clean$arrival_month %in% c(3, 4, 5) ~ "Spring",
  hotel_data_clean$arrival_month %in% c(6, 7, 8) ~ "Summer",
  hotel_data_clean$arrival_month %in% c(9, 10, 11) ~ "Fall",
  hotel_data_clean$arrival_month %in% c(12, 1, 2) ~ "Winter",
  TRUE ~ "Unknown"
)
# Convert booking_status to a binary variable
hotel_data_clean$booking_status_binary <- ifelse(hotel_data_clean$booking_status == "Canceled", 1, 0)

Creating subsets for each season:

# Create subsets for each season
spring_data <- subset(hotel_data_clean, season == "Spring")
summer_data <- subset(hotel_data_clean, season == "Summer")
fall_data <- subset(hotel_data_clean, season == "Fall")
winter_data <- subset(hotel_data_clean, season == "Winter")

Remove columns with zero variance:

# Function to remove columns with zero variance
remove_zero_variance <- function(df) {
  df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}

# Remove zero variance columns for each season
spring_data_filtered <- remove_zero_variance(select_if(spring_data, is.numeric))
summer_data_filtered <- remove_zero_variance(select_if(summer_data, is.numeric))
fall_data_filtered <- remove_zero_variance(select_if(fall_data, is.numeric))
winter_data_filtered <- remove_zero_variance(select_if(winter_data, is.numeric))

Calculate correlation matrices:

# Calculate and visualize correlation for Spring
cor_spring <- cor(spring_data_filtered, use = "complete.obs")
corrplot(cor_spring, method = "color", addCoef.col = "black", 
         title = "Correlation Matrix for Spring", number.cex = 1, 
         tl.cex = 0.8, mar = c(1, 1, 2, 1))

# Calculate and visualize correlation for Summer
cor_summer <- cor(summer_data_filtered, use = "complete.obs")
corrplot(cor_summer, method = "color", addCoef.col = "black", 
         title = "Correlation Matrix for Summer", number.cex = 1, 
         tl.cex = 0.8, mar = c(1, 1, 2, 1))

# Calculate and visualize correlation for Fall
cor_fall <- cor(fall_data_filtered, use = "complete.obs")
corrplot(cor_fall, method = "color", addCoef.col = "black", 
         title = "Correlation Matrix for Fall", number.cex = 1, 
         tl.cex = 0.8, mar = c(1, 1, 2, 1))

# Calculate and visualize correlation for Winter
cor_winter <- cor(winter_data_filtered, use = "complete.obs")
corrplot(cor_winter, method = "color", addCoef.col = "black", 
         title = "Correlation Matrix for Winter", number.cex = 1, 
         tl.cex = 0.8, mar = c(1, 1, 2, 1))

Fall Correlation Matrix:
The strongest positive correlation with booking status (0.54) is lead time, suggesting that as the lead time increases, there is a higher chance of the booking being canceled. This makes sense, as bookings made far in advance may have a higher likelihood of being reconsidered or cancelled.
Number of special requests has a slight negative correlation with booking status (-0.23), indicating that bookings with more special requests tend to have a lower likelihood of cancellation.

Summer Correlation Matrix:
Lead time continues to show a strong positive correlation (0.43) with booking status, compared to other variables, meaning that longer lead times are associated with cancellations during the summer as well.
Number of special requests shows a moderate negative correlation (-0.30), reinforcing the idea that bookings with special requests are less likely to be cancelled.
Average price per room has a weak positive correlation (0.22), suggesting that higher-priced rooms might be slightly more likely to be cancelled in the summer.

Spring Correlation Matrix:
The positive correlation with lead time remains somewhat significant at 0.29, consistent with the previous seasons. Longer lead times are associated with a higher likelihood of cancellation.
Special requests have the strongest negative correlation (-0.36) in the spring, indicating that bookings with more special requests are much less likely to be canceled during this season.
Average price per room shows a small positive correlation (0.13) with booking status, indicating a slight tendency for higher-priced bookings to be cancelled.

Winter Correlation Matrix:
Lead time remains a positive trending factor, with a correlation of 0.24. Longer lead times during the winter continue to be associated with higher cancellation rates.
The number of special requests has a negative correlation (-0.14), suggesting that special requests still reduce the likelihood of cancellations in the winter, although the effect is less pronounced compared to other seasons.
Average price per room has a slightly stronger positive correlation (0.36), indicating that higher-priced rooms may have a higher chance of being cancelled during the winter season.

Conclusion

  1. This analysis suggests that a guest’s previous cancellation history could be a reliable predictor of their future booking behavior. Hotels could use this information to refine their overbooking strategies, potentially allocating more flexible cancellation policies to first-time guests or those without a history of cancellations. Conversely, they might implement stricter policies or require deposits from guests with a history of cancellations to mitigate potential losses.
  2. Cancellation rate for bookings without special requests: 8545 / (8545 + 11232) ≈ 43.2%. Cancellation rate for bookings with special requests: 3340 / (3340 + 13158) ≈ 20.2%. There’s a clear difference in cancellation rates, with bookings having special requests showing a significantly lower cancellation rate.
  3. Both longer lead times and higher room prices increase the likelihood of cancellations. The t-test shows that the average lead time for canceled bookings (139.2 days) is significantly higher than for non-canceled bookings (58.9 days), with a confidence interval of [78.3, 82.2] days and verage room price for canceled bookings (110.6 units) is higher than for non-canceled bookings (99.9 units), with a confidence interval of [9.92, 11.39]. These factors together can significantly affect revenue forecasts, occupancy rates, and operational efficiency.
  4. Relationships between variables varied greatly between seasons. Lead time was found to have a stronger correlation with booking cancellations in the fall compared to the winter. This suggests that in the fall, customers with longer lead times (those who book further in advance) are more likely to cancel their bookings. In contrast, during the winter, the correlation is weaker, possibly due to the shorter, more concentrated vacation period around holidays, where bookings are made with more certainty and less time to cancel. This highlights how seasonal trends influence customer behavior in relation to cancellations.

Closing Thoughts

Our initial research question aimed to identify the factors that most significantly influenced hotel booking cancellations. After conducting EDA, it became clear that some factors, such as lead time, special requests and meal plan type, played a far more significant role in predicting cancellations than others. The analysis also revealed that seasonality interacted with these variables, with certain seasons showing stronger correlations between cancellations and lead time. As a result, the question evolved to focus more specifically on the interaction between these high-impact factors. Based on our analysis, we can begin to outline an answer to the research question. Lead time emerged as the most critical factor, with longer lead times strongly correlated with cancellations, particularly during specific months of the year. Customers booking well in advance were more likely to cancel, especially in the months leading up to busy travel seasons. Special requests also appeared to reduce the likelihood of cancellations, suggesting that these customers are more committed to their bookings. These findings indicate that focusing on lead time management and offering incentives to secure early-booking customers could be key strategies to reduce cancellations.